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Abstract 

Many applications in translational medicine require the understanding of how diseases progress 
through the accumulation of persistent events. Specialized Bayesian networks called monotonic 
progression networks offer a statistical framework for modeling this sort of phenomenon. Current 
machine learning tools to reconstruct Bayesian networks from data are powerful but not suited to 
progression models. We combine the technological advances in machine learning with a rigorous 
philosophical theory of causation to produce Polaris, a scalable algorithm for learning progression 
networks that accounts for causal or biological noise as well as logical relations among genetic events, 
making the resulting models easy to interpret qualitatively. We tested POLARIS on synthetically 
generated data and showed that it outperforms a widely used machine learning algorithm and 
approaches the performance of the competing special-purpose, albeit clairvoyant algorithm that is 
given a priori information about the model parameters. We also prove that under certain rather 
mild conditions, POLARIS is guaranteed to converge for sufficiently large sample sizes. Finally, 
we applied Polaris to point mutation and copy number variation data in Prostate cancer from 
The Cancer Genome Atlas (TCGA) and found that there are likely three distinct progressions, 
one major androgen driven progression, one major non-androgen driven progression, and one novel 
minor androgen driven progression. 
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1. Introduction 

Modern data science focuses on scientific problems that are replete with high dimensional 
data, with the data-dimension approaching the sample size. This situation has become 
often too common in biology, biomedicine and social sciences. Typically, data are collected 
and summarized in a joint distribution and some useful patterns are extracted from the 
distribution. Graphical models, in particular Bayesian networks [13, 16, 21], succinctly 
represent these joint distributions and extract the statistical dependencies between the 
variables, effectively filtering out indirect relationships to expose the underlying structure 
of interactions in the system. While this approach is widely applicable, it fails to provide 
the kind of information needed for many clinical problems, such as survival prediction, 
therapy design and drug resistance in cancer. These problems would benefit greatly from 
models that describe a temporal ordering of events describing a progressive process. 

Here, the useful information lies in asymmetric relationships, such as causality and prece- 
dence, and not necessarily symmetric ones such as correlation. Several research groups have 
produced statistical progression models of varying complexities but remain disconnected 
from the technological advances made in the machine learning community, particularly in 
structure learning and conditional inference in graphical models. 

In this paper, we present a novel framework to synthesize recent advances in graphical 
models with a sound and rigorous theory of causality. Namely, we consider the set of 
probabilistic logical conditions underlying Suppes's probabilistic causation theory [19] to 
identify positive prima facie causes. In other words, we look for a cause C modifying the 
effect E, positively, by C being temporally prior to E and C raising the probability of E; 
not all positive prima- facie causes are genuine. In this paper, we show how to translate 
these probabilistic logical conditions into the standard regularized, maximum likelihood 
score of Bayesian networks, and devise a score based machine learning algorithm, Polaris 
(Progression mOdel LeARnlng Score), to extract the underlying progression and causal 
structures although the data are non-temporal and cross sectional. 

The second novelty for Polaris is in the way it handles what one may choose to describe 
as a causal noise, which accounts for the net effect of unmodeled (usually, minor and/or 
rare) causes on an event in the absence of the event's canonical causes. This model thus 
differs from most statistical models of progression, which focus on observational noise, or 
the effects of mislabeling the occurrence of an event, in either direction. Last but not 
least, Polaris tackles a wider range of causal relations, naturally including all that can 
be described with a probabilistic boolean logic. This capability makes the resulting model 
easily interpretable with phenomenological statements such as "the presence of EGFR and 
MYC mutations causes a mutation in P53 but a mutation in either gene alone does not.'''' 

The rest of the paper is structured as follows. It starts with a technical description 
of graphical models and progression models, some approaches to structure learning for 
both types of models, and a perspective on the limitations of existing structure learning 
algorithms for progression models. It then describes the development of our algorithm, 
Polaris, grounded on its philosophical roots which lead to its mathematical definitions 
and ultimately, to its practical implementation. It follows this section with theoretical 



Downloaded from http://biorxiv.org/on September 18, 2014 



3 

convergence results in the case of sufficiently large samples and a demonstration of its 
practical performance across many realistic data sizes. The next section illustrates how 
Polaris works, with an application to a practical example in Prostate cancer, while pro- 
ducing some novel hypotheses for its progression. The paper concludes with a discussion 
of various related issues. 

2. Model Descriptions and structure learning 

2.1. Bayesian Networks. A Bayesian network (BN) is a statistical model that provides a 
sparse and succinct representation of a multivariate probability distribution over n random 
variables and encodes it into a sparse directed acyclic graph (DAG), 1 G = (V,E) over 
n = \ V\ nodes, one per variable 2 , and \E\ <C \V\ 2 directed edges. The full joint distribution 
factors as a product of conditional probability distributions (CPDs) of each variable, given 
its parents in the graph. In a DAG, the set of parents of node Xi consists of all the nodes 
with edges that point to Xi and is written as Pa(Xi). In this paper, we represent CPDs 
as tables (see figure 1), in which each row represents a possible assignment of the parents 
and the corresponding probability of the child, here, a Bernoulli random variable G {0, 1}, 
when it takes the value 1. 

P(xi,...,Xn) = Y\ V{Xi= Xi\Pa(Xi) = x Pa ^). 

The set of edges E represents all the conditional independence relations between the 
variables. Specifically, an edge between two nodes Xi and Xj denotes statistical con- 
ditional dependence, no matter on which other variables we condition. Mathematically, 
this means that for any set of variables S C V \ {Xi,Xj}, it holds that V(Xi,Xj \ S) ^ 
V{Xi | S)V(Xj | S). In the BN, the symmetrical nature of statistical dependence means 
that the graphs Xi — > Xj and Xi Xj encode the same conditional independence rela- 
tions. We call two such graphs /-equivalent 3 and a set of such graphs a Markov equivalence 
class. In fact, any graphs that contain the same skeletons and u-structures are Markov 
equivalent. Here, the skeleton refers to the undirected set of edges, in which Xi — > Xj and 
Xi <— Xj both map to Xi o Xj, and a v -structure 4 refers to a node with a set of at least 
two parents, in which no pair of parents share an edge. 



A DAG consists of a set of nodes (V) and a set of directed edges (E) between these nodes, such that 
there are no directed cycles between any two nodes. 

2 In our setting, each node represents a Bernoulli random variable taking values in {0, 1}. By convention, 
we refer to variables with upper case letters (e.g. Xj) and the values they take with lower case letter (e.g. 

Xi). 

**J stands for independence here. 

4 In BN terminology, parent with no shared edge are considered "unwed parents." For this reason, the 
^-structure is often called an immorality. In other texts, it is referred to as an unshielded collider. 
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2.2. Monotonic Progression Networks. We define a class of Bayesian networks over 
Bernoulli random variables called monotonic progression networks (MPNs) , a term coined 
in [7]. MPNs formally represent informal and intuitive notions about the progression of 
persistent events that accumulate monotonically, based on the presence of other persistent 
events 5 . The conditions for an event to happen are represented in the CPDs of the BN using 
probabilistic versions of canonical boolean operators, namely conjunction (A), inclusive 
disjunction (V), and exclusive disjunction (©), as well as any combination of propositional 
logic operators. Figure 1 shows an example of the CPDs associated with various operators. 

While this framework allows for any formula to define the conditions of the parent events 
conducive for the child event to occur, we chose a simpler design to avoid the complexity 
of the number of possible logical formulas over a set of parents. Namely, we define three 
types of MPNs, a conjunctive MPN (CMPN), a disjunctive MPN (DMPN 6 ), and an exclu- 
sive disjunction MPN (XMPN). The operator associated with each network type defines 
the logical relation among the parents that should hold for the child event to take place. 
Arbitrarily complex formulas can still be represented as new variables, whose parent set 
consists of the variables in the formula and whose value is determined by the formula itself. 
This design choice assumes that most of the relations in a particular application fall un- 
der one category, while all others are special cases that can be accounted for individually. 
Mathematically, the CPDs for each of the MPNs are defined below: 

CMPN: 

Pr{X = l\^Pa{X) <\Pa{X)\) < e, 

Pr{X = l\^Pa(X) = \Pa(X)\) > e. 

DMPN: 

Pr(X = Pa{X) = 0) < e, 

Pr(X = Pa(X) > 0) > e. 

XMPN: 

Pr{X = l\Y J Pa{X)^l) < e, 

Pr(X = 1| Pa{X) = 1) > e. 

The inequalities above define the monotonicity constraints specific to each type of MPN, 
given a fixed "noise" parameter e. When a particular event occurs despite the monotonicity 
constraint, we say that the sample is negative with respect to that event. If the event does 



5 In this text, we use the terms variable and event interchangeably. 
^Sometimes referred to as a semi-monotonic progression network (SMPN). 
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not occur or occurs in compliance with the monotonicity constraint, then it is a positive 
sample of that event. Note that in the case in which e = 0, the monotonicity constraints are 
deterministic and all samples are positive. By convention, we sometimes refer to the rows 
of a CPD as positive and negative rows and use Of to refer to the conditional probability 
of some positive row i and 9^ to refer to the conditional probability of some negative row 
i. 

Finally, we note that probabilistic logical relations encoded in Bayesian networks are not 
entirely new and have been studied in the artificial intelligence community as noisy-AND, 
noisy- OR, and noisy-XOR networks [16]. 

2.3. Structure learning. Many algorithms exist to carry out structure learning of general 
Bayesian networks. They usually fall into two families of algorithms [13], although several 
hybrid approaches have been recently proposed [4]. The first, constraint based learning, 
explicitly tests for pairwise independence of variables conditioned on the power set of the 
rest of the variables in the network. The second, score based learning, constructs a network 
to maximize the likelihood of the observed data, with some regularization constraints 
to avoid over-fitting. Because the data are assumed to be independent and identically 
distributed (i.i.d.), the likelihood of the data is the product of the likelihood of each datum, 
which in turn is defined by the factorized joint probability function described in section 2.1. 
For numerical reasons, log likelihood (LL) is usually used instead of likelihood, and thus 
the likelihood product becomes the log likelihood sum. 

In this paper, we build on the latter approach, specifically relying on the Bayesian 
Information Criterion (BIC) as the regularized likelihood score. The score is defined below: 

score B ic(D,G) = LL(D\G) - ^y^dim(G). 

Here, G denotes the graph (including both the edges and CPDs), D denotes the data, M 
denotes the number of samples, and dim(G) denotes the number of parameters in the CPDs 
of G. The number of parameters in each CPD grows exponentially with the number of 
parents of that node. For our networks over events, dim(G) for a single node X is 2 ][Pa<yX ^. 
Thus, the regularization term — dim(G) favors nodes with fewer parents or equivalently, 
graphs with fewer edges. The coefficient log M/2 essentially weighs the regularization term, 
such that the higher the weight, the more sparsity will be favored over "explaining" the 
data through maximum likelihood. Note that the likelihood is implicitly weighted by the 
number of data points, since each point contributes to the score. 

With sample size enlarging, both the weight of the regularization term and the "weight" 
of the likelihood increase. However, the weight of the likelihood increases faster than that 
of the regularization term 7 . Thus, with more data, likelihood will contribute more to the 
score. Intuitively, with more data, we trust our observations more and have less need for 
regularization, although this term needs never completely vanishes. 



Mathematically, we say that the likelihood weight increases linearly, while the weight of the regulariza- 
tion term logarithmically. 
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Statistically speaking, BIC is a consistent score [13]. In terms of structure learning, this 
property implies that for sufficiently large sample sizes, the network with the maximum 
BIC score is /-equivalent to the true structure, G* . From the discussion in 2.1, it is clear 
that G will have the same skeleton and ^-structures as G* , though nothing is guaranteed 
regarding the orientation of the rest of the edges. For most graphs, therefore, BIC cannot 
distinguish among G* plus all other possible graphs and thus is not sufficient for exact 
structure learning. In the case of BNs with structured CPDs, such as MPNs, it is possible 
to improve on the performance of BIC. For example, Farahani et al. modified the BIC score, 
as described in section 3.2 to drastically improve performance in learning the orientations 
of all edges. 

2.4. Observational vs Biological Noise. The notion of probabilistic logical relations 
among variables to represent disease progression has been developed in two families of 
models. These two approaches diverge in the treatment of noise, or equivalently, in how 
the model produces negative, or non-monotonic, samples. The first approach, represented 
initially by Beerenwinkel et al. [10] and more recently by Ramazzoti et al. [17], encodes 
a notion of experimental, or observational, noise, in which negative samples result from 
incorrect labeling of the events. In this model, each generated sample is initially positive 
in all variables and then may have several event values inverted, with a certain probability. 
The second approach, represented initially by Farahani et al. [7] and now by the work 
presented here, encodes biological or causal noise, in which negative samples result from 
the activation of events by some non-canonical causes, in the absence of canonical ones. 
In models like these, the level of noise corresponds to the probability that an event occurs 
despite the absence of its parents. 

Observational noise and biological noise have different statistical properties that affect 
how the model is learned. Namely, observational noise is often assumed to be unbiased and 
have a Gaussian distribution and thus by the strong law of large numbers, converges to zero 
for a sufficiently large number of observations. In contrast, biological noise is asymmetric 
and persists even with large sample sizes. One of the key consequences of these differences 
is the following: While the asymptotic marginal probabilities of the variables are the same 
for all levels of noise in the observational noise model, for biological noise, however, the 
marginal probabilities are very sensitive to the level of noise, irrespective of how large the 
sample size is. See section 4.5 for details on how this affects learning algorithm presented 
in this paper. 
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Figure 1. The Polaris algorithm accepts raw cross sectional genomic 
data and computes a causal progression model with logical relations among 
the variables. Initially (top left), each patient's tumor is sampled during 
surgery and sequenced afterwards. From the sequencing, we find that each 
tumor has genomic aberrations in certain genes and not others. Most genes 
will be common among the tumors, although some may be outliers (high- 
lighted in gray). This data is then projected into a high dimensional space 
(top right) and the genes' co-occurrence frequencies are encoded as a joint 
distribution over the gene variables. Polaris mines this data for causal 
relations (bottom right) and encodes the major causal progressions among 
the genes in a graphical model. The minor causes account for the outliers 
in the data and often reflect a varying spectrum in cancer types among the 
patients. These minor causes are averaged and collapsed into a causal or 
biological noise parameter in the model. Finally, many genomic events, for 
instance CDK mutation, seem to precipitate from the occurrence two or 
more events, for instance EGFR and MYC mutations. We provide a lan- 
guage for expressing this dependence (bottom left). Using the examples in 
the figure, we can allow CDK to occur only when both EGFR and MYC 
occur (CMPN), when either one occurs (DMPN), or when only one but not 
both occur (XMPN) . The examples of conditional probability distributions 
(CPDs) reflect these logical relations. 
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3. State of the art review OR Insufficiency of current methods 

In our background review, we only consider algorithms that learn progression networks 
with biological noise. Efficient and effective algorithms to learn models of observational 
noise have been developed and are described in the literature [14, 17]. Here, we consider 
global optimization of the BIC score, a representative algorithm for learning general BNs, 
and DiProg, the only algorithm specifically developed for learning CMPNs and DMPNs. 

3.1. BIC is not sufficient for exact structure learning. Structure learning for Bayesian 
networks has improved tremendously in the last decade. In particular, the problem of max- 
imum likelihood learning of Bayesian networks with only discrete and observable variables 
can usually be solved to optimality using integer programming and LP relaxation. More- 
over, regularized ML scores such as BIC have nice mathematical properties that guarantee 
asymptotic convergence to an /-equivalent structure. However, for most graphs, the opti- 
mal BIC score does not belong to one particular structure. In fact, it belongs to the group 
of structures defined by a Markov equivalence class 8 . Therefore, structure learning through 
BIC alone cannot distinguish between many structures, only one of which is the structure 
of the true generating graph. Therefore, BIC is insufficient for exact structure learning. 
However, for BNs with structured CPDs, such as the progression networks described in 2.2, 
it is possible to design a score for more accurate structure learning. 

3.2. DiProg algorithm outperforms BIC. Farahani et al. [7] proposed an algorithm, 
DiProg, for learning MPNs that outperforms BIC consistently. DiProg learns the structure 
by optimizing a modified BIC score through reduction to an integer program and LP relax- 
ation. The modification is in the ML parameter estimation of the conditional probabilities. 
Specifically, if the estimated parameter for P(X\ ^2Pa(X) ^ \Pa(X)\) is greater than e, 
then set it to e. This modification penalizes graph structures that result in non-monotonic 
conditional probability parameters. Although the authors do not provide a mathematical 
proof of convergence, it is empirically seen that most of the edges in the original network 
are learned in the correct orientation, given enough samples. 

3.3. DiProg is not sufficient for real data. The modification to the BIC score improves 
performance but relies on a priori knowledge of e, which is rarely available. In fact, the 
performance of DiProg depends strongly on knowing the correct level of noise (see figure 2). 
This limitation makes the algorithm unreliable for applications on real data, in which e 
cannot be known. In this paper, we present an alternative algorithm, Polaris, that 
learns MPNs and DMPNs without knowledge of e. We also show that Polaris performs 
significantly better than optimizing BIC and in most cases, better than DiProg with a 
random e. 

4. Developing Our Causal Score 

We present a score, namely, the one used in Polaris, that is statistically consistent, like 
BIC, and correctly orients edges based on the monotonicity of the progression relation, like 



'Skeleton and immoralities. 
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DiProg, but without knowing the parameter e a priori. The basic idea behind the score 
is a heuristic for the likelihood of each sample such that the likelihood reflects both the 
probability of the sample being generated from its CPD and the probability that the CPD 
obeys the monotonicity constraints of the true model. Of course, we cannot compute the 
latter without knowledge of e and thus rely on a nonparametric notion of monotonicity 
to estimate the underlying CPD. Below, we start with an explanation the development 
of Polaris and conclude with its philosophical foundations to its asymptotic convergence 
properties. 

4.1. Foundation in Suppes causality. We modeled our score after the asymmetrical 
portion, a, of the causal score, presented earlier in [14]. The authors based this part 
of the score on Suppes's theory of causality for distinguishing prima facie causes from 
non-causal correlations. Suppes stipulates two conditions for event C to cause event E. 
First, C must raise the probability of E. In the authors' statistical model, this means that 
V(E | C) > V{E | C). Second, C must precede E in time. Unfortunately, the authors' 
model, just like ours, has no notion of time and could not directly infer temporal priority. 
However, under the condition that C is the unique cause of E, it is necessary that C must 
appear every time E appears but not vice versa. Therefore, the number of occurrences of 
C must be larger than that of E. From this, it is easy to see that V[C) > V(E). In fact, 
this property of temporal priority also holds for conjunctions over several parents, as E 
will only appear when all its parents are present. 

They define their a score for a causal relation C — > E as pT^j^v^^j^l • They prove 
that this definition meets both the probability raising and temporal priority conditions 
explained above. In their paper [14], the authors only consider tree structured graphs, in 
which every node has at most 1 parent and at most 1 negative row in its CPD. Applied 
to an MPN, the true a value for each CPD must be strictly positive for each edge - a 
consequence of the constraint that V{E \ C) > V(E \ C) for all MPNs. Thus, when we 
consider several graphs to fit to observed data, an estimated a with a negative value (below 
a threshold) means that the corresponding CPD breaks the monotonicity constraint. On 
the other hand, an estimated a with a positive value (above a threshold) puts more faith in 
the legitimacy of that CPD. Otherwise, the interpretation of CPD is ambiguous. Justified 
by these intuitive observations, we claim that a serves as a faithful proxy for monotonicity 
in tree structured MPNs. 

4.2. Weighted Likelihood Without A Priori Knowledge of Model Parameters. 

In this work, we consider more general DAG structured models, in which CPDs can have 
more than one negative row. To handle this, we assign an a score to each row of the CPD, 
as defined below. We adopt the notation a x i to denote the a value corresponding to row i 
of the CPD of variable X. By our convention, denotes the probability of negative row 
i and 0+ the probability of the one 9 positive row of the CPD of X. 



This assumption is only true for CMPNs. We extend this notation to DMPNs and XMPNs later. 
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1, for a positive row; 

a+ a — 

for a negative row. 



ei+e: 



Thus, as argued earlier, a is now a heuristic for the monotonicity of each row of a CPD 
rather than the CPD as a whole. It follows that each negative sample has a corresponding 
a between —1 and 1. Thus, we weigh each negative sample by its a value to reflect our 
belief that its CPD row conforms to the monotonicity constraints. This strategy leads to 
CPDs with high monotonicity to be favored through their samples, whereas CPDs with 
poor monotonicity are penalized through their samples. Moreover, by handicapping the 
samples instead of the CPDs directly, we allow rows whose conditional probabilities were 
estimated with more samples to have a larger effect on the score. The resulting a- weighted 
likelihood score (score q,wl) for variable X given sample d is defined below, where 
and are empirical estimates of their respective parameters. Note that because of the 
indicator function in the exponent of the a term in the score, only the a term of the row 
that corresponds to the sample is used to weigh the likelihood. Specifically, if the sample 
is positive, the likelihood is not altered, whereas if the sample is negative, the likelihood is 
penalized in proportion to the a score for that sample's corresponding row. 

score aWL (X : d) = Pr (X = d x \Pa(X) = d Pa{X )) • ^p<x)-CPD x {% _ 

ie\CPD x \ 

Of course, the score we use for structure learning includes the BIC regularization term, 
so the full combined score for a single variable X given a datum d is below. The last line 
defines the composed score for the all the variables, V, over all the data, D. 

score aWL mc (X : d) 

\CPD X \ 

log Pr(X = d x \Pa(X) = d Pa(x) ) ■ J] a ^<^= CPD *W 



XI 

1=1 



!ogM 



dim(X\Pa(X)), 



2 
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\CPD X \ 

Pr(X = d x \Pa(X) = d Pa{x) ) + HdpaiX) = CPD x (i)) log a x 
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2 
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= LL(d x ,d Pa(x) \G) + a(X\d) 
score aWL BIC (G : D) 

= LL(D\G) + 

deDX&v 

For brevity, we use the shorthand 

a(X\d) = Yl Mdpai*) = CPD x (i)) log a xl . 
ie\CPD x \ 

In other words, it is the a of the row of the CPD of X that corresponds to dp 0 pn- 

4.3. Multiplicative factor improves performance and makes certain asymptotic 
guarantees. Asymptotically, the BIC is known to reconstruct the correct skeleton and 
orient edges in immoralities correctly. Since we desire our score to enhance this result 
further and orient the remaining edges correctly without disturbing the correct skeletal 
structure, we introduce a new weight to the whole monotonicity term of the score. This 
weight is structured to approach zero in the limit, as the sample size approaches infinity. 
Thus, for small sample sizes, the monotonicity component will play a larger role in the 
overall score. Then, as the BIC component converges to a more stable structure, the 
monotonicity component chooses the exact structure among several equally likely ones. 
For these asymptotic results, we chose the simplest weight that is inversely proportional 
to the sample size: 1 /m. The final score we developed for structure learning of MPNs is 
below. 

score Po i aris (G : D) 

= LL{D\G) + ^Y,T, a ^ X \ d )- l ^ dim ^)- 

deDX&v 

We prove mathematically that this score asymptotically learns the correct exact struc- 
ture of an MPN under certain conditions - especially, conditions enforcing the absence of 
transitive edges and a sufficiently low e parameter. In practice, however, we found that 
our algorithm converges on the correct structure for graphs with transitive edges and non- 
negligible e values (see figure 2). 

Definition 1 (Faithful temporal priority). In a monotonic progression network G, if 
there exists a path from Xj to X%, then the temporal priority between Xi and Xj is faithful 
ifV(Xj)>V(Xi). 

Theorem 1 (Convergence conditions for Polaris). For a sufficiently large sample size, 
M , under the assumptions of no transitive edges and faithful temporal priority relations 



logM 

— - — dim(X\Pa(X)), and, finally 
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(see Definition 1) between nodes and their parents at least for nodes that have exactly 1 
parent, optimizing POLARIS converges to the exact structure. □ 
See supp. mat. for a complete proof. 

4.4. Extension to DMPNs and XMPNs. The score stated in the previous section 
works for all three classes of MPNs, with minor modifications to the definition of a, de- 
pending on the monotonicity constraints. The main difference between CMPNs and the 
other two types lies in the fact that each CPD corresponding to a CMPN has exactly one 
positive row. In contrast, the CPDs in DMPNs have exactly one negative row and the 
CPDs in XMPNs may have multiple positive and negative rows (see figure 1). Specifically, 
the only negative row for DMPNs is the case in which all parent nodes equal zero. For 
XMPNs, any row with exactly one parent event equal to one is a positive row and all the 
rest are negative rows. In order to extend the definition of a to DMPNs and XMPNs, we 
treat all events that correspond to the positive rows of a CPD as one event. The probability 
of this large event is called 9 + , just as in the CMPN case, and it is defined below for both 
DMPNs and XMPNS. 

OdmpnW = vC£Pa(X)>0), 
&xmpn(X) = P(J2MX) = 1). 
With these alternative constructions of 6, a is well defined for all three types of MPNs. 

4.5. Temporal Priority in the Presence of Biological Noise. The a score for learning 
models in [14] and [17] enforces both probability raising and, for conjunctive or singleton 
parent sets, temporal priority. The model of noise considered there has the property 
that, for sufficient large sample sizes, by the large of large numbers, the probability of a 
negative sample approaches zero. However, in our model of noise, #~'s are fixed parameters 
and do not approach zero. Thus, temporal priority cannot always be correctly imputed 
for all causal relations. That is, C — > E does not necessary mean that V{C) > V(E). 
Instead, temporal priority is decided by e, 9 + and the marginal probabilities, as specified 
in the equation below. Specifically, high e and correspondingly high 6~ , low 9 + and close 
marginal probabilities all make it easier to reverse the observed temporal priority. 

V{X) = V(Pa(X) = l)-9 + + £(1 - V(Pa(X) = CPD x {i))) ■ 0~ 

i 

Note that in this context, 6 + is uniquely defined, as we assume either a conjunctive or 
singleton parent set, and the sum is only over the negative rows of the CPD. Asymptotically, 
this score works just as well for DMPNs and XMPNs as it does for CMPNs for graphs 
without transitive connections. This is because, in the proof of Theorem 1, temporal 
priority must only hold for nodes with exactly one parent, and in that case, the three 
monotonicity constraints are indistinguishable. 
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5. MPN STRUCTURE LEARNING WITH POLARIS 

In this section, we describe and analyze the algorithm that uses the Polaris score to 
learn MPN structure. 

5.1. a Filtering. Before optimizing the score, there are certain parent sets that one may 
wish to eliminate as hypotheses. This pre-optimization filtering is done for two reasons. 
First, it prevents the optimization algorithm from selecting a spurious parent set. Second, 
it speeds up computation significantly by not computing the full score for that hypothet- 
ical parent set. We use the a score to filter hypotheses, rejecting those solutions that 
create a negative a for at least one row of the CPD. This a-filter is used for all types of 
MPNs and greatly improves efficiency without eliminating too many true hypotheses. In 
fact, we proved mathematically that asymptotically, the a filter will be free of any mistakes. 

Lemma 1 (Convergence of a-filter). For a sufficiently large sample size, M, the a-filter 
produces no false negatives for CMPNs, DMPNs, and XMPNs. □ 
See supp. mat. for a proof. 

5.2. Optimizing the score with GOBNILP. After pruning the hypothesis space with 
the a filter, we use GOBNILP [6, 1, 12], a free, publicly available BN structure learning 
package, to find the the network with the highest Polaris score. Given an upper bound on 
the maximum number of parents (by default, 3), GOBNILP expects as input the scores for 
each node given each possible combination of parents. For each node, our code produces 
this information with a depth first search through the power set of the rest of the nodes in 
the graph. Any hypothetical parent set that is filtered is simply not included as a possible 
solution for that node in the input to GOBNILP. 

6. Results 

6.1. Performance on Synthetic Data. We conducted several experiments to test the 
performance of Polaris on data generated from synthetic networks, all on ten variables. 
The network topologies were generated randomly, and the CPDs were generated according 
to the monotonic constraints imposed by the type of MPN and the value of e. These 
networks were sampled with different sample sizes. In all experiments, the performance 
metrics were measured over fifty synthetic topologies sampled ten times, for each value of 
e and sample size. 

We compared the performance of Polaris against two standards, the optimization of 
the BIC score and the clairvoyant 10 DiProg algorithm, across a variety of biologically and 
clinically realistic e values and sample sizes. To evaluate the performance of each algorithm, 
we measured both the recall, the fraction of true edges recovered, and the precision, the 
fraction of recovered edges that are true. We placed detailed figures for recall and precision 
at realistic sample sizes as well as asymptotic sample sizes for CMPNs, DMPNs, and 
XMPNs in the supplementary material. In figure 2, we summarize these results concisely 



By clairvoyant, we mean that the algorithm has a priori knowledge of e. 
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for all three types of MPNs by using AUPR, or the area under the precision-recall curve, 
as our performance metric. We expected Polaris to perform significantly better than 
BIC, which is nonspecific for monotonic relations and slightly worse than the clairvoyant 
DiProg algorithms, as Polaris does not have access to the correct value of e. The results 
showed this exact trend for recall, precision, and AUPR. The gap between the clairvoyant 
DiProg and Polaris remained consistent across all parameter values and relatively low, 
as opposed to the gap between Polaris and BIC optimization 

Finally, we considered the performance of Polaris against a non-clairvoyant DiProg by 
passing DiProg one of fifty randomly sampled values of e. Because of the cost of running 
DiProg fifty times, we limited our model type to CMPN, e to 0.15, and sample size to 
200. The box plot in figure 2 shows the variance of performance for Polaris, the average 
performance of the non-clairvoyant DiProg, the performance of the non-clairvoyant DiProg 
with the most incorrect value of e, and finally, the performance of the clairvoyant DiProg. 
Again using AUPR as the performance metric, we found that the average performance of 
the non-clairvoyant DiProg had a significantly lower mean and considerably larger variance 
than those of Polaris. Moreover, the mean of the worst case performance of DiProg 
was almost twice as low as that of Polaris, and the variance was slightly larger. From 
these analyses, we conclude that when e is not known, we expect more accurate and more 
consistent results from POLARIS than from DiProg. 

In the supplementary material, we also demonstrate the efficacy and accuracy of the a- 
filter for CMPNs, DMPNs, and XMPNs. On average, the filter eliminates approximately 
half of all possible hypotheses and makes considerably less than one mistake per network. 
In fact, for sufficiently large sample sizes, the false negative rate drops to almost zero. 
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Figure 2. We tested the performance of Polaris against the optimization 
of a standard symmetric score, BIC, and a clairvoyant algorithm for learning 
MPNs, DiProg. We tested each algorithm across several different levels of 
noise (0% to 30%) and across several realistic number of training samples 
(50 to 500). In each case, the network contained ten variables, common 
for progression models, although each algorithm can handle a great deal 
more. The three surface plots show the performance of each algorithm for 
different MPN types, CMPN on the top left, DMPN on the bottom left and 
XMPN on the bottom right. The box plots on the top right demonstrate the 
dependence of DiProg performance on a priori knowledge of e. We learned 
a network with ten variables, 15% noise and 200 samples with Polaris, 
DiProg with the correct e, and DiProg with a random e. The second column 
shows the average performance across the random e values, the third column 
shows the worst performance with a random e value, and finally, the fourth 
column shows the performance with knowledge of the correct e value. For 
all four plots, we measured the rate of both true positives (recall) and true 
negatives (precision) by computing the area under the precision-recall curve, 
or the AUPR. 
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6.2. Biological Example. We demonstrate the use of Polaris on prostate cancer (PCA) 
data. We conducted a literature search to find the genomic events most prominent in PCA 
and some theories about the ordering of these events. We limited our search to copy number 
variations (CNVs), mutations and fusion events, as these are believed to be persistent. 
From the experimental observations of the papers we found [9, 11, 22, 3, 20, 2, 18], we 
posit a progression model with 3 distinct sub- progressions. To test this theory, we learned 
a CMPN based on the copy number alteration (CNA), mutation, and fusion event data on 
the genes discussed above. We used the TCGA [15] prostate adenocarcinoma dataset of 
246 sequenced tumors, available through MSKCC's cBioPortal [5, 8] interface. 




Three Predicted Subtypes of Prostate 
Cancer Progression 
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FIGURE 3. Polaris was used to learn a CMPN model for prostate cancer. 
We selected the most commonly implicated oncogenes, tumor suppressor 
genes, and gene fusion events from the literature and used copy number 
variation and point mutation data from the TCGA database. Each edge is 
labeled with the fold change in the network score when the edge is left out. 
Based on the topology and our literature survey, we define three distinct 
progressions within the graph and each is labeled red, green or yellow. 
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We found that our learned model, shown in figure 3, validates and unifies the observations 
of the papers above in one tri-progression model. First, we found two major progressions, 
one centered around TMPRSS2-ERG fusion (below referred to as just "ERG') and an- 
other around CHD1 and SPOP. This confirms Barbieri et al.'s [3] theory of two distinct 
progressions defined by SPOP and ERG. Moreover, our model captures the associated 
genes Barbieri et al. predicted in each progression. Namely, CHD1, FOX03, and PRDM1 
are involved in the SPOP progression and PTEN and TP53 in the ERG progression. Next, 
we postulate that MYC, NCOA2 and NCOR2 are all involved in a third progression, even 
though NCOR2 appears isolated from the other two in the graph. We justify this decision 
by noting the observations of Grasso et al. [11], Taylor [20], Weischenfeldt et al. [22] and 
Gao et al. [9]. Grasso et al. predict that there is a third progression that includes neither 
CHD1 nor ERG. Taylor et al. predict that there is a subtype with poor prognosis that 
involves the amplification of MYC and NCOA2. Weischenfeldt et al. predict that early 
onset PCA involves the Androgen receptor (AR) pathway and NCOR2 mutation but does 
not include ERG, CHD1, or PTEN. Gao et al. show an experimental connection between 
MYC and AR expression, strengthening the MYC/NCOA2 involvement in the third path- 
way. Lastly, the figure shows several key driver genes [NKX3-1, APC, ZFH3, THSD7B, 
FOXP1, SHQL, RB, RYBP) in the progression of PCA that have not been assigned to 
either the SPOP or ERG progressions. The model proposes an assignment of these genes 
to their respective progressions that can be experimentally tested. As a sanity check, we 
note that FOXP1, SHQ1, and RYBP, all genes in the 3pl4 region, are closely related in 
the progression. 

7. Discussion and Future Work 

Polaris is a machine learning (ML) algorithm for discovering causal structure from 
data, founded on score-based graphical models, in which the score builds on classical prob- 
abilistic theory of causality developed by Suppes. Graphical models, in particular Bayesian 
networks, are by now extensively studied and well understood. There is an active commu- 
nity of researchers dedicated to developing powerful tools for efficient structure learning, 
parameter estimation, and conditional inference. Many such tools are publicly available as 
open source platforms and are continually evolving with data science applications: both 
in businesses and sciences. Polaris derives its power and flexibility from this eco-system 
of tools. Despite the abundance of these ML tools so far, practically all existing learning 
algorithms for graphical models have been ill-suited to the task of monotonic progression 
reconstruction. Polaris is able to uniquely tailor these algorithms to suit this particular 
task. Although, we are not the first ones to attempt to solve this problem (see Fahrani [7]), 
we are the first to devise a fully score based, non-clairvoyant algorithm (i.e., no prior knowl- 
edge of the parameters of the causal noise). In particular, we address causal or biological 
noise in a realistic manner, thus paving the way to real practical applications. 

Polaris accomplishes its intended tasks effectively and efficiently. To quantify its effi- 
cacy, we provide a theoretical analysis in the supplementary materials, containing a proof 
of its asymptotic convergence under some mild conditions. Moreover, we empirically tested 
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the algorithm extensively on a variety of noise levels and sample sizes. We found that it 
outperforms the standard score for structure learning and closely trails behind the clair- 
voyant one. We do not believe, however, that Polaris, by virtue of its machine learning 
abilities, can solely and completely solve all the underlying problems in cancer systems 
biology. It has several shortcomings: it is not yet the definitive algorithm to cover all 
notions of causality, some of which could be important to decipher a progressive disease 
like cancer; it depends on astute experimentalists and incisive experiments to provide those 
observations that underpin how the disease progresses; and finally, it relies on molecular 
biologists to interpret its output before it can be related to the mechanistic models of 
genes, expressions and signaling. What we have achieved instead are the following: an 
operationalized version of a rigorously developed theory of causality, now well integrated 
with the machine learning technology, and a useful tool for biologists interested in the 
progression of genomic events in cancer. 

In our future work, we will explore several of these shortcomings of the Polaris frame- 
work: we will develop more robust statistical estimators, and infer synthetically lethal 
interactions from data. With the latter, we will prioritize development of therapy-design 
tools based on progression models to guide cancer drug selection as well as discovery of 
concrete hypotheses for novel drug targets. 
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